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Abstract — Gene regulation in higher eukaryotes involves a 
complex interplay between the gene proximal promoter and 
distal genomic elements (such as enhancers) which work in 
concert to drive spatio-temporal expression. The experimental 
characterization of gene regulatory elements is a very complex 
and resource-intensive process. One of the major goals in 
computational biology is the in-silico annotation of previously 
uncharacterized elements using results from the subset of known, 
annotated, regulatory elements. 

The recent results of the ENCODE project presented in-depth 
experimental analysis of such functional (regulatory) non-coding 
elements for 1% of the human genome. This dataset enables 
the principled association of experimental results with true 
regulatory activity from in-vitro or in-vivo studies. It is hoped that 
the results obtained on this subset can be scaled to the rest of the 
genome. This is an extremely important effort which will enable 
faster dissection of other functional elements in key biological 
processes such as disease progression and organ development. 
The computational annotation of these hitherto uncharacterized 
regions would require an identification of features that have good 
predictive value for regulatory behavior. 

Though the exact mechanism of gene regulation is not com- 
pletely known, several data-driven experimental models have 
been hypothesized to understand transcription, pointing to se- 
quence, expression, transcription factor (TF) and their interac- 
tome level attributes, at the biochemical and biophysical levels. 
This has largely been possible due to the advent of new techniques 
in functional genomics, such as TF chromatin immunoprecipita- 
tion (ChIP), RNA interference, microarray expression profiling, 
yeast-2-hybrid (Y2H) screens and chromosome conformation 
capture studies. However, these features are yet to be mean- 
ingfully integrated for understanding transcriptional regulatory 
mechanisms computationally. It is believed that such data-driven 
computational models can be extremely useful to the discovery 
of new regulatory elements of desired function and specificity. 

In this work, we study transcriptional regulation as a problem 
in heterogeneous data integration, across sequence, expression 
and interactome level attributes. Using the example of the Gata2 
gene and its recently discovered urogenital enhancers [33] as 
a case study, we examine the predictive value of various high 
throughput functional genomic assays in characterizing these 
enhancers and their regulatory role. Observing results from the 
application of modern statistical learning methodologies for each 
of these data modalities, we propose a set of attributes that are 
most discriminatory in the localization and behavior of these 
enhancers. 
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Index Terms — Nephrogenesis, Random Forests, Transcrip- 
tional regulation, Transcription factor binding sites (TFBS), 
GATA genes, comparative genomics, functional genomics, tissue- 
specific genes, network analysis, directed information, heteroge- 
neous data integration. 

I. Introduction 

Understanding the mechanisms underlying regulation of 
tissue-specific gene expression remains a challenging question. 
While all mature cells in the body have a complete copy 
of the human genome, each cell type only expresses those 
genes it needs to carry out its assigned task. This includes 
genes required for basic cellular maintenance (often called 
"house-keeping genes") and those genes whose function is 
specific to the particular tissue type that the cell belongs to. 
Gene expression by way of transcription is the process of 
generation of messenger RNA (mRNA) from the DNA tem- 
plate representing the gene. It is the intermediate step before 
the generation of functional protein from messenger RNA. 
During gene expression, transcription factor (TF) proteins are 
recruited at the proximal promoter of the gene as well as at 
sequence elements (enhancers/silencers) which can lie several 
hundreds of kilobases from the gene's transcriptional start site 
(Figs. Q] and |2). 



TATA box 




Fig. 1. Schematic of Transcriptional Regulation. Sequence motifs at the 
promoter and the distal regulatory elements together confer specificity of gene 
expression via TF binding. 

It is hypothesized that the collective set of transcription 
factors that drive (regulate) expression of a target gene are cell, 
context and tissue dependent ([55], [70]). Some of these TFs 
are recruited at proximal regions such as the promoter of the 
gene, while others are recruited at more distal regions, such 
as enhancers. There are several (hypothesized) mechanisms 
for promoter-enhancer interaction through protein interactions 
between TFs recruited at these elements during formation 
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of the transcriptional complex [54]. A commonly accepted 
mechanism of distal interaction, during regulation, is looping 
([66], [21], [44]), shown in Fig. 

To understand the role of various genomic elements in 
governing gene regulation, functional genomics has played 
an enabling role in providing heterogeneous data sources 
and experimental approaches to discern distal interactions 
during transcription. Each of these experiments have aimed to 
resolve different aspects (features) of transcriptional regulation 
focussing on TF binding, promoter modeling and epigenetic 
preferences for tissue-specific expression in some genomic 
regulatory elements ([17], [29], [36], [64], [44]). Additionally, 
some studies have demonstrated that these data sets along 
with principled statistical metrics can be used to derive such 
features computationally, with a view to asking questions 
relevant to the biology of transcriptional regulation ([29], [64], 
[49], [37]). 

There have been several principled yet scattered studies 
characterizing the role of functional regulatory elements for 
certain genes (such as Mecp2, Shh, Gata2, Gata3) in various 
organisms ([44], [40], [33], [39], [27], [52]). These reveal an 
inherent spatio-temporal context of gene expression and regu- 
lation. However, there is a need for a unified set of principles, 
spanning various genomic attributes, that could account for the 
behavior of these tissue-specific and gene-specific enhancers. 
We note that there are promoter-independent enhancers too, 
and their computational study has been far more principled 
([55], [56]); however, their study is outside the scope of this 
study where we focus on gene-specificity in addition to tissue- 
specificity. 

The results of the ENCODE project 
{http://encode.nih.gov/), ([17], [36]) on 1% of the human 
genome has established some very interesting results about 
the nature of transcriptional regulation at the genome scale. 
Particularly, they report the use of several experimental 
techniques (Histone ChIP on chip, DNASE1 hypersensitivity 
assays) etc analyzing transcribed regions as well as their 
regulatory regions, genome-wide. A large scale computational 
effort is developing alongside to "learn" features of such 
regulatory elements and use of these features for predicting 
other control elements for genes outside the ENCODE 
regions, thereby accomplishing a genome-wide annotation. 
Considering that over 98% of the genome is non-coding, 
this effort is going to parallel the previous project in gene- 
annotation at the genome scale in effort and importance. 
Adding to this complexity is the fact that the same non-coding 
element can potentially regulate the expression of genes in a 
spatio-temporal manner, activating different genes at different 
times in different tissues, and from arbitrarily large distances 
from the gene. Thus there is a need for the principled 
"reverse-engineering" of the architectures of these regulatory 
elements, using features at the sequence, expression and 
interactome level. 

Understanding the mechanism of transcriptional regulation 
thus entails several aspects: 

1 ) Do regulatory regions like promoters and enhancers have 
any interesting sequence properties depending on their 
tissue-specificity of gene expression? These properties 



can be examined based on their individual sequences 
or their epigenetic preferences. A common technique 
of analysis is the identification of tissue-specific motif- 
signatures ([45], [38]) for such elements. 
2) To reduce the vast number of false positives that arise 
from sequence approaches alone, we appeal to a mech- 
anistic insight from biology. For long-range transcrip- 
tional regulation to be enabled, there has to be an 
enhancer-promoter interaction during formation of the 
tissue-specific, gene-specific transcriptional machinery. 
Literature suggests that such interaction is mediated 
by protein-protein interactions between promoter TFs 
and enhancer TFs after looping along the chromosomal 
length ([44], [4], [26], [66]). This insight (Fig. El poses 
two further questions: 

• Which TFs bind the promoter and the putative 
enhancer? 

• Do the resultant interactions between enhancer and 
promoter TFs have any special characteristic that 
discriminate functional non-coding regulatory re- 
gions from non-functional ones? 

As a case study to answer some of these questions, we 
examine the regulation of Gata2 regulation in the developing 
kidney. Gata2 is a gene belonging to the GATA family of 
transcription factors (GATA1-6), and binds the consensus - 
WGATAR- motif on DNA [51]. It is located on mouse 
chromosome 6, and plays an important role in mammalian 
hematopoiesis, nephrogenesis and CNS development, with 
important phenotypic consequences. The study of long-range 
regulatory elements that effect Gata2 expression has been 
on for several years now. The most common strategy for 
identifying possible regulatory elements has hitherto been 
inter-species conservation studies. Using this approach, all 
elements flanking the gene that are conserved more than 
some threshold are retained for further experimental charac- 
terization. The reason underlying this strategy is that truly 
functional elements are under evolutionary pressure to retain 
their function across species. Given the technical complexity of 
associated transgenic experiments, this turns out to be a fairly 
inefficient strategy, especially since the number of candidate 
regulatory elements increases as larger genomic regions are 
examined (to account for distal regulation). Such a scenario 
prompts the need for an integrative strategy to reduce the 
number of candidates obtained from a purely conservation- 
based search strategy using other, complementary genomic 
modalities. 

Recently, [33] reported the characterization of two enhancer 
elements, conferring urogenital-specific expression of Gata2, 
between 80 — 150kb away from the gene locus, on chro- 
mosome 6. In this work, we examine if genomic features, 
other than sequence identity, are predictive of the location 
of these elements. These feature span sequence, expression 
and interactome perspectives for such regulatory elements. We 
will also attempt to motivate the utility of these approaches 
(metrics and data sources) as well as their biological relevance 
alongside (how they fit into the biophysics of transcriptional 
regulation). It must be pointed out that there is large paucity in 
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data availability, in that data specific to the developing kidney 
is hard to come by. Under this constraint, we have made some 
biologically plausible assumptions so as to obtain maximum 
information from currently available data sources. 

II. Rationale and Data Sources: 

The overall schematic of distal transcriptional regulation 
via looping is given in Fig. [2] This schematic suggests the 
decomposition of the regulatory process along three main 
modalities: sequence, expression and interactome. Our main 
goal in this paper is to understand urogenital (kidney) enhancer 
behavior from these three perspectives. These attributes are 
discussed below: 

Fig. 2. Distal enhancer-promoter interaction via looping is mediated via 
protein interactions during TF complex formation. The set of TFs that are 
putatively recruited at the proximal promoter and distal enhancer can be 
found from sequence and expression data [59]. Evidence of protein-interaction 
between proximal and distal TFs can be found from interaction databases. 

1) Sequence Perspective: To build motif signatures under- 
lying kidney-specific enhancer activity, it would be best 
to have a database of previously characterized urogenital 
(UG) enhancers. However, due to the unavailability of 
such data, we utilize kidney-specific promoter sequences 
and histone-modified sequences of enhancers to find 
motif-signatures of regulatory elements that are poten- 
tially UG enhancers. 
< Promoters of kidney-specific genes: A catalog of 
kidney-specific mouse promoters is available from 
the GNF Symatlas {http://symatlas.gnf.org/). This 
database contains list of annotated genes and 
their expression in several tissue types, includ- 
ing the kidney. Since the proximal promoter of 
such kidney-specific genes harbors the transcrip- 
tional machinery for gene regulation, their se- 
quences putatively have motifs that are associated 
with kidney-specific expression. Additionally, pro- 
moters that are spatio-temporally expressed dur- 
ing kidney development are also analyzed (MGI: 
http://www.informatics.jax.org/). The GNF dataset 
profiles mostly adult tissue-types. Since our goal is 
to study enhancer activity during nephrogenesis, we 
focus on genes expressed between day elO and el2 
in the developing kidney - such a list is obtained 
from the MGI database. 

Without loss of generality, we use six-nucleotide 
motifs (hexamers) as the motifs. This is based on the 
observation that most transcription factor binding 
motifs have a 5 — 6 nucleotide core sequence with 
degeneracy at the ends of the motif. A similar 
setup has been introduced in ([9], [30]). The main 
difference in our approach from such previous work 
is that differential hexamer analysis was done for 
the same class of sequences, and the statistical 
nature of the "test-set" is, by design similar to the 
training set. That is, in [9], differential hexamers 
are found between known Cis-Regulatory Modules 



(CRMs) and non-CRMs, and used for the predic- 
tion of new CRMs from sequence. On the other 
hand, [30] deals with finding hexamer features of 
known promoters and using them to predict new 
promoters from sequence. In our case, however, we 
don't have enhancer data (equivalent to CRMs) and 
we are using promoter-data for the prediction of 
enhancer (CRM) instead. Thus, the nature of the test 
sequence is very different. We demonstrate that our 
approach is partially useful in the discovery of pu- 
tative enhancers from sequence. Also,the presented 
motif-finding approach does not depend on motif 
length and can be scaled depending on biological 
knowledge. 

We set up the motif discovery as a feature extrac- 
tion problem from these tissue-specific promoter 
sequences and then build a random forest (RF) 
classifier to classify new sequences into specific 
and non-specific categories based on these identified 
sequence features (motifs). Based on the motifs 
derived using a RF classifier algorithm we are able 
to accurately classify more than 95% (training- 
error rate) of tissue-specific genes based upon their 
upstream promoter region sequences alone. Since 
promoters are non-coding regulatory regions, the 
derived motifs can be putatively used to find kidney- 
specificity of other non-coding regions genome- 
wide (Section: I Villi ). 
• Chromatin marks in known regulatory elements: 
Using the recently released ENCODE data, 
a catalog of sequences that undergo histone 
modifications such as methylation and acetylation 
is available for analysis [36]. Reports suggest 
that mono-methylation of the lysine residue of 
Histone H'i is associated with enhancer activity 
[29] whereas tri-methylation of H3KA and Hi 
acetylation are associated with promoter activity. 
Using this set of H3K4mel, HSKAmei and H3ac 
sequences, we aim to find sequence motifs that are 
indicative of such epigenetic preferences during 
transcription. Though epigenetic data is available 
for five different cell lines, we choose the HeLa cell 
line data because of its widespread use as a model 
system to understand transcriptional regulation 
in-vitro in the laboratory. Thus, we build a RF 
classifier to discriminate monomethylated H3K4 
sequences from trimethylated iJ3i^4/acetylated 
H3 sequences. We note that this data is not kidney- 
specific, and such data is yet to become available. 
This yields motifs associated with epigenetic 
properties of promoters and enhancers, which are 
potentially predictive of the regulatory potential for 
novel sequences (section: IIXK 

2) Expression Perspective: There is limited expression 
data for the developing mouse kidney, mainly due 
to technical reasons concerning small tissue yield at 
such early time points. For this study, we use mi- 
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croarray expression data from a public repository of 
kidney microarray data {http://genet.chmcc.org [68], 
http://spring.imb.uq.edu.au/ [7]). Each of these sources 
contain expression data profiling kidney development 
from about day 10.5 dpc to the neonate stage. Such 
expression data can be mined for potential regulatory 
influence between upstream TF genes and Gata2. 

< Inference ofTF effectors at the promoter region: The 
TFs putatively recruited at the proximal promoter 
are identified using the Directed Information metric, 
that uses gene-expression level influence in addition 
to phylogenetic conservation of the corresponding 
binding site. We have earlier shown that DTI is a 
good predictor of gene influence and can be used 
to infer transcriptional regulatory networks [59]. A 
more detailed explanation is given in sections: IX-AI 

through EA2] 

< Inference ofTF effectors at each non-coding region: 
At the distal enhancer, it is believed that there is 
recruitment of tissue-specific transcription factors 
that co-operate with the basal transcriptional ma- 
chinery (at the promoter) to direct tissue-specific 
gene expression ([35], [45]). Whereas phylogeny 
and expression-based influence metrics can yield 
high confidence candidates for promoter TFs, a 
similar analysis for enhancers is not possible, be- 
cause of higher order effects ([49], [38]). To this 
end, the only plausible way to search for enhancer 
TFs is to combine phylogeny with tissue-specific 
annotation (from UNIPROT or MGI). Hence, ev- 
ery transcription factor, whose motif is conserved 
at a non-coding (putative enhancer) region and is 
tissue-specific in annotation is considered a likely 
candidate TF at that non-coding region. 

3) Interactome Perspective: The identification of phy- 
logenetically conserved effector TFs at the promoter 
(identified via DTI), as also those that are phylogenet- 
ically conserved at the putative enhancers, lead to the 
exploration of protein-interactions between these TFs, 
during distal enhancer-promoter interaction (SecJXji. The 
STRING database (http://string.embl.de) integrates var- 
ious experimental modalities (genomic context, high- 
throughput experiments such as co-immunoprecipitation, 
co-expression and literature) to maintain a list of 
organism-specific functional protein-association net- 
works that is amenable to such exploration. 

In this work, the above questions will be integratively 
answered for training data as well as in the context of the uro- 
genital enhancers identified in [33]. We aim to show that each 
of these 'features' have a predictive value for the identification 
of enhancers and the integration of these heterogeneous data 
can lead to potential reduction in false positive rate during 
large-scale enhancer discovery, genome-wide. To date, there 
has been no comprehensive study for summarizing these var- 
ious heterogeneous data sources to understand transcriptional 
regulation. 



III. Validation/Biological Application 

As suggested in Sec: |U we use the recently identified Gata2 
urogenital (UG) enhancers to validate our computational ap- 
proach. All the data sources (and their analysis) are therefore 
going to be focused on the kidney. 

The experimental characterization of these enhancers was 
done as follows. Based on BAC transgenic [33] studies, the 
approximate location of the urogenital enhancer(s) of Gata2 
were localized to a 70 kilobase region on chromosome 6. 
Using inter-species conservation plots, four elements were 
selected for transgenic analysis in the mouse. These were 
designated UG1, 2, 3 and 4. After a lengthy and resource- 
intensive experimental effort, two out of these four non-coding 
elements, U G2 and UGA were found to be true UG enhancers. 
Our goal is to find "features" at the sequence, expression and 
interactome level, that are predictive of this reported behavior 
of elements UGl — 4 in the developing kidney. 

It is easy to see the utility of such a methodology, since this 
can be scaled up contextually for other genes of interest. Given 
the complexity of 1% of the genome, made possible by the 
ENCODE project, the search for functional elements genome- 
wide is going to be an important and challenging exercise. 

IV. Organization 

With a view to understanding the elements of transcrip- 
tional regulation, the first part of this paper (Sections IVT - 
lIXI l addresses the problem of identifying motif signatures 
representative of transcriptional control from kidney-specific 
promoters and epigenetically marked sequences. The second 
part of this work (Sections IX-AI - IX-Bt integrates phylogeny 
and expression data to find regulatory TFs at the proximal 
promoter and enhancer(s) of Gata2. Using the notion of TF 
interactions between enhancer and promoter, we examine if 
protein-interaction data (Sec: IX-Cl l can offer supporting evi- 
dence for the observed in-vivo behavior of four putative Gata2 
regulatory elements. Classifiers are designed to discriminate 
regulatory vs. non-regulatory regions based on each of these 
multiple modalities. Finally, a probabilistic combination of 
these classifiers is done to obtain a validation (Sec: IXH of 
the Gata2 UGEs (UGl - 4). Sections: EH and Em] conclude 
the paper. 

V. Sequence Data Extraction and Pre-processing 

The Novartis foundation tissue-specificity atlas 
[http://symatlas.gnf.org/], has a compendium of genes 
and their corresponding tissues of expression. Genes have 
been profiled for expression in about twenty-five tissues, 
including adrenal gland, brain, dorsal root ganglion, spinal 
chord, testis, pancreas, liver etc. Considering these diversity 
of tissue-types, one concern with the interpretation of this 
data is the variability in expression across tissue-types. To 
address this concern, we take a fairly stringent approach 
- if a gene is expressed in less than three tissue types, it 
is annotated tissue-specific ('ts'), and if it is expressed in 
more than 22 tissue types, it is annotated to be non-specific 
('nts'). Based on this assignment, we find a list of 86 genes 
that are tissue-specific as well as have kidney expression 
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(MGI: http://www.infonnatics.jax.org/). For these kidney- 
specific genes, we extract their promoter sequences from 
the ENSEMBL database (http://www.ensembl.org/), using 
sequence 2000bp upstream and lOOObp downstream up to the 
first exon relative to the transcriptional start site reported in 
ENSEMBL (release 37). 

Before proceeding to motif selection, a matrix of motif- 
promoter correspondences is created. In this matrix, the counts 
of hexamer (six-nucleotide) motif occurrence in the 'ts' and 
'nts' promoters is obtained using sequence parsing (R package: 
'seqinr'). The motif length of six is not overly restrictive, since 
it corresponds to the consensus binding site size of several an- 
notated transcription factor motifs in the TRANSFAC/JASPAR 
databases. A Welch t-test is then performed between the rela- 
tive counts of each hexamer in the two expression categories 
('ts' and 'nts') and the top 1000 hexamers with p — value < 
10 -6 are selected. This set of discriminating hexamers is 
designated (H = Hi, H2, ■ ■ ■ , iiiooo)- This procedure resulted 
in two hexamer-gene co-occurrence matrices, - one for the 'ts ' 
(or +1) class of dimension Ntrain.+i x 1000 and the other 
for the 'nts' (or —1) class - dimension Nt ra in,-i X 1000. 
Here N tra in,+i is the matrix of the 86 kidney-specific genes. 
Ntrain,-i is the set of 'nts ' that do not have kidney-specific 
expression. 

As an illustration, we show a representative matrix (Table. 

ID. 



Ensembl Gene ID 
ENSG00000155366 
ENSGOOO001780892 
ENSG00000189171 
ENSG00000168664 
ENSG00000160917 
ENSG00000176749 
ENSG00000006451 



AAAAAA 
1 
4 
1 
4 
2 
1 
3 



AAATAG 
1 
3 
2 
3 
1 
1 
2 



Class 
+1 
+ 1 
-1 
-1 
-1 
-1 
+ 1 



TABLE I 

The 'motif count matrix' for a set of gene-promoters. The first 
column is their ensembl gene identifiers, the next 2 columns 
are hexamer quantile labels, and the last column is the 
corresponding gene's class label ( + 1/ - 1). 



All the above steps, from sequence extraction, 
parsing and quantization to obtain hexamer-promoter 
counts that are done for the kidney-specific genes can 
be repeated for the histone-modified sequences. This 
dataset is obtained from the Sanger ENCODE database 
(http://www. sanger.ac. uk/PostGenomics/encode/data- 
access.shtml), and contains 298 sequences that undergo 
modification (ml/me3/ac) in histone ChIP assays. 140 
of these correspond to H3K4mel (enhancers), and 158 
correspond to HSKAmeZ/ HZac marks (promoters). Here, 
the 1000 hexamers discriminating H3K 4mel-sequences 
(+1 set) and a (HiKAmeZ/ H2>ac) (—1), are designated 
H' = H 1 ,H 2 , ■ ■ ■ , H 1Q00 . 

VI. Motif-Class Correspondence Matrices 

From the above, N tr ain.+i x 1000 and N tra in,-i X 1000 
dimensional count matrices are available both for the kidney- 
promoter and histone-modified sequences. Before proceeding 



Sequence AAAATA AAACTG Class 

chr2:41410492-41411867 2 1 +1 

chr6:41654502-41654782 4 2 +1 

chr3:41406971-41408059 1 1 -1 

chi2:41665970-41667002 2 3 +1 

chr4:41476956-41478365 1 2 -1 

chi5:41530471-41531046 2 2 -1 

chrX:41783327-41784532 1 2 +1 

TABLE II 

The 'motif count matrix' for a set of histone-modified 
sequences. the first column is their genomic locations along 
the chromosome, the next 2 columns are hexamer quantile 
labels, and the last column is the corresponding sequence 
class label ( + 1/ — 1). 



to the feature (hexamer motif) selection step, the counts of 
the M = 1000 hexamers in each training sample need to be 
normalized to account for variable sequence lengths. In the 
co-occurrence matrix, let gci k represent the absolute count 
of the k th hexamer, k € 1, 2, . . . , M in the i th gene. Then, 
for each gene gt, the quantile labeled matrix has Xi t k — I if 
9 c i r'-ijwi — 9°i,k < 9 c i [J-m]> K — 4- Matrices of dimension 
Ntrain,+i x 1001, Ntrain,~i x 1001 for the specific and 
non-specific training samples are now obtained. Each matrix 
contains the quantile label assignments for the 1000 hexamers 
(Xi, i € (1,2,..., 1000)), as stated above, and the last column 
would have the corresponding class label (Y = — 1/ + 1). 
Having constructed two groups of genes for analysis, tissue 
specific ('ts') and non-tissue specific ('nts') - we seek to 
find hexamer motifs which are most discriminatory between 
these two classes. Our goal would be to make this set of 
motifs as small as possible - i.e. to achieve maximal class 
partitioning with the smallest feature subset. Towards this goal, 
we explore the use of random forests (RF) [5] for finding such 
a discriminative hexamer subset. 

VII. Random Forest Classifiers 

A random forest (RF) is an ensemble of classifiers obtained 
by aggregating (bagging) several classification trees ([28], [5]). 
Each data point (represented as an input vector) is classified 
based on the majority vote gained by that vector across all 
the trees of the forest. Each tree of the forest is grown in the 
following way: 

• A bootstrapped sample (with replacement) of the training 
data is used to grow each tree. The sampling for boot- 
strapped data selection is done individually at each tree 
of the forest. 

• For an ill-dimensional input vector, a random subspace 
of m (<C M)-dimensions is selected, and the best split 
on this subspace is used to split the node. This is done 
for all nodes of the tree. Each tree is grown to maximum 
length, with no pruning. 

During the training step, before sampling by replacement, 
one-third of the cases is kept "out of the training bag". This 
oob (out-of-bag) data is used to obtain an unbiased estimate 
of the classification error as trees are added to the forest. It is 
also used to get estimates of variable importance. 
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From the above we see that the classifier structure of the 
random forest is an ensemble of trees. Each tree is trained 
and built on a different bootstrap sample (split) of the training 
data. Hence each tree has a different topology. Unlike a tree 
classifier, therefore, it is not possible to obtain a "consensus 
topology" of the RF classifier. In the absence of one unifying 
structure for the purpose of visualization, we can inspect the 
other outputs like variable importance, confusion matrix, and 
OOB error rate to ascertain the accuracy and performance of 
the RF classifier. 

The variables selected for optimal partitioning over class 
labels can be examined from a variable importance plot which 
indicates which variables are most discriminatory between 
these two classes ([5], [42]). It is also to be noted that random 
forests afford the dual advantage of both training and test-set 
error estimation (through the OOB data) during the overall 
training procedure. Thus there is no separate procedure for 
test-set error estimation that needs to be implemented in 
the case of RFs. Each tree in the ensemble is trained on a 
|rd — ^rd split of the data. Each tree is grown to get the 
least oob error before being incorporated into the classifier 
ensemble. 

A confusion matrix is one representative tool to understand 
the performance of the RF classifier. After the training pro- 
cess, the confusion matrix measures the discordance between 
true and predicted classes (and can be used for OOB error 
estimation). Each row represents the instances of the actual 
class, while each column of the matrix represents the instances 
in a predicted class. The matrix can then be used for false- 
positive, false-negative, true-negative and true-positive rate 
computations. 

Several interesting insights into the data are available using 
random forest analysis. The variable importance plot yields the 
variables that are most discriminatory for classification under 
the 'ensemble of trees' classifier. This importance is based on 
two measures- 'Gini index' and 'decrease in accuracy'. The 
Gini index is an entropy based criterion which measures the 
purity of a node in the tree, while the other metric simply looks 
at the relative contribution of each variable to the accuracy of 
the classifier. For our studies, we use the 'randomForest' pack- 
age for R [42]. The classifier performance on the individual 
data and the related diagnostics are mentioned under each head 
(Sees: EE] and EJ. 

VIII. Random Forests on Kidney-specific 

PROMOTERS 

In this section, we aim to find discriminating sequence mo- 
tifs between a set of kidney-specific promoters and housekeep- 
ing promoters with a goal to find sequence motifs underlying 
kidney-specific regulation. The kidney enriched dataset has 86 
genes that are assigned to a tissue specific class and have 
higher than mean expression in the kidney. For the purpose of 
training and testing, we consider the set of housekeeping genes 
identified from the 'nts' class and reported in literature ([16], 
[20]). There are almost 1500 genes in the housekeeping gene 
('nts') set. Since, this would lead to unbalanced predictions 
during classifier training, we use a stratified sampling approach 



[42] to select for a sample size that reduces this effect (the 
sampling itself is done with a prior on the relative sizes of 
the two classes). Here, the set of (—1) promoter-sequences 
are taken to be of the same size as the (+1) class. Using this 
approach, we obtain a training-error classification accuracy of 
> 95% on the kidney enriched tissue-specificity data set. 

Before proceeding to motif identification, it is necessary to 
check for possible sequence bias (GC composition) between 
the two classes of promoters (kidney-specific vs. housekeep- 
ing). Though there are several kinds of sequence bias [60], 
the composition bias is most closely related to this problem. 
If there is a significant bias, then the motifs turn out to be just 
GC rich sequences that are not very biologically informative 
[69] for regulatory potential. The GC composition of these 
two classes of sequences is represented in Fig. [3] We note 
that though only a subset of 'nts' gene-promoters were used 
during the RF analysis, we show the GC-composition for the 
entire class of 'nts' sequences for completeness. As can be 
seen, the average GC composition is the same. The ROC space 
representation and variable importance plot for the overall 
classification is indicated below (Fig. QT| and Fig. 2]). The 
confusion matrices are all explained in the context of the 
classifier combination in SectionlXll 
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Fig. 3. GC plots for sequence bias in kidney-specific vs. housekeeping 
promoters. 



To address a related question, we examine if the 
top ranked hexamers in the kidney dataset correspond 
sequence-wise to known transcription factor binding 
sites. Using the publicly available Opossum tool 
(http://www.cisreg.ca/cgi-bin/oPOSSUM/opossum/) or 
MAPPER (http://bio.chip.org/mapper), we found several 
interesting transcription factors to map to these motifs, 
such as Nkx, ARNT, c-ETS, FREAC4, NEAT, CREBP, 
E2F, HNF4A, Pax2, MSX1, SP1 several of which are 
kidney-specific. Though this is highly consistent with the 
tissue-specificity of the dataset, the functional relevance of 
these sites remains to be experimentally validated. 
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Fig. 4. Top hexamers which can discriminate between kidney-specific and 
house-keeping genes. 



Fig. 5. GC plots for sequence bias in H3K4mel histone sequences vs. 
H3K4me3 and H3ac sequences. 
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IX. RFS ON CHROMATIN-MODIFIED SEQUENCES 

We train a RF classifier on a set of 298 sequences from 
chromosome sequence that have varying histone modifications 
associated with them (namely, H3K4mel/me3, and H3ac 
), as mentioned in Section: HU These sequences had a high 
level of the corresponding histone-modification from ChIP 
experiments. The other regions that were assayed for but did 
not have high levels of modification are not considered in this 
analysis. These are derived from the HeLa cell line and are not 
necessarily context-specific for kidney development. However, 
given the widespread use of this cell line for transcriptional 
studies, we aim to find if the motifs associated with regulatory 
elements are indeed predictive of enhancer activity. 

Here too, we examine the GC-composition bias of these two 
sequence classes (Fig. [5]) and confirm that there is no such 
sequence bias that would skew the discovery and subsequent 
interpretation of these epigenetic motifs. 

The motifs obtained from the random forest analysis in- 
dicate the "sequence-preferences" of regulatory elements that 
are kidney-specific (Fig. |4| or nucleosome-free (Fig. [6]). For 
the kidney-specific case, the underlying caveat is that co- 
expression does not imply co-regulation; however we are 
only using the discovered motifs to understand the "sequence- 
preferences" of kidney-specific regulatory-regions [34] rather 
than using them for de-novo prediction of new genes that are 
regulated by the same transcriptional machinery. Most of the 
motifs do not overlap TFBS motifs and might be indicative of 
more interesting sequence properties. We analyze the perfor- 
mance of these classifiers on the 4 UG enhancers, mentioned 
previously. In both cases, UG2 — 4 are classified as kidney- 
specific enhancers, whereas UG1 is correctly classified as not 
being regulatory. Additionally, a control set of "promoter- 
independent" enhancers derived from the Mouse Enhancer 
database [55] was also classified as enhancers based on these 
chromatin signatures. This high prediction accuracy inspite 
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Fig. 6. Top hexamers which can discriminate between H3K4mel histone 
sequences vs. H3K4me3 and H3ac sequences. 



of non-specificity of cell context (HeLa cell line) is very 
interesting and has potentially high predictive value. This is 
explored further in Sec: [XH 

We now proceed to the mechanistic insight (based on TF 
effector identification and PPI) mentioned in Section. U to 
understand the behavior of putative regulatory elements. 

X. PPI BETWEEN PROMOTER AND ENHANCER TFS 

In order to understand the nature of distal interactions 
between the enhancer and promoter TFs (Fig. [2j, we decouple 
the overall regulation problem into three parts: 

1) Identification of putative TF effectors at the promoter 
(Section: IX-AI) . 

2) Identification of enhancer TFs (Section: IX-Bb . and 
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3) Examination of the interaction-graph formed between 
enhancer- TFs and promoter TFs (Section: IX-CI ). 



A. TF effector identification at Promoter and Enhancer 

Promoter TF identification: TFs that regulate basal tran- 
scription at the promoter can be identified from phylogenetic 
conservation or co-expression studies. In this approach, the 
promoter sequence (here, the Gata2 promoter) is aligned 
across multiple species and the TFBS motifs that are conserved 
in the multiple alignment are considered to be putative effec- 
tors of gene regulation. An additional step involves examining 
the promoters of all genes that are co-expressed in the same 
spatio-temporal manner as the gene of interest (e.g.: Gata2 
in the kidney). Such sequence-based approaches have been 
examined in literature ([45], [38], [49]). 

Since the list of putative TFs (identified above) that poten- 
tially bind at the promoter is still large, there have been efforts 
to incorporate gene-expression data to reduce the set of poten- 
tial TF effectors. In this scenario, if the gene corresponding 
to the conserved TF has a high expression-level influence on 
Gata2 expression, then that TF has stronger evidence for being 
a potential regulator ([46], [41]). Recently, we introduced the 
directed information (DTI) as a metric to infer expression- 
level influence between any putative transcription factor (TF) 
gene and a target gene (such as Gata2) [59]. We will briefly 
summarize the utility of DTI for TF effector identification in 
the following sections (Sec. IX-A. 1 1 and |X-A.2t . This seeks to 
integrate sequence and expression data into the determination 
of relationships between transcription factors and their target- 
genes. All additional details (performance on synthetic data, 
other biological data and comparison with other metrics) are 
available in [59]. Information-based measures have enabled the 
investigation of non-linear gene relationships in the presence 
of measurement noise [46]. An important point to note is that 
unlike mutual information, the DTI is a directed metric that 
enables the inference of both strength and direction of gene 
influence. 




k 




L A 


nil 











Fig. 7. TFBS conservation between Human, Mouse and Rat, upstream (x- 
axis) of Gatal, from http://www.ecrbrowser.dcode.org/. The mouse sequence 
is the base sequence and is hence not displayed. The dark and light red regions 
correspond to potential TF binding regions on DNA. 



1 ) DTI Formulation: As alluded to above, there is a need 
for a viable influence metric that can find relationships be- 
tween the TF "effector" gene (identified from phylogenetic 
conservation) and the target gene (like Gata2). Several such 
metrics have been proposed, notably correlation, coefficient 
of determination (CoD), mutual information etc. To alleviate 
the challenge of detecting non-linear gene interactions, an 
information theoretic measure like mutual information has 
been used to infer the conditional dependence among genes 
by exploring the structure of the joint distribution of the gene 
expression profiles [46]. However, the absence of a directed 
dependence metric has hindered the utilization of the full 
potential of information theory. In this section, we examine 
the applicability of one such metric - the directed information 
criterion (DTI), for the inference of non-linear, directed gene 
influences. 

The DTI - which is a measure of the directed dependence 
between two A^-length random processes X = X N and Y = 
Y N is given by [48]: 



N 



I(X N ->Y N ) = ^I{X n -Y n \Y n ~ 1 ) 



(1) 



Here, Y n denotes (Y{, Yz, .., Y n ), i.e. a segment of the 
realization of a random process Y and I(X N ;Y N ) is the 
Shannon mutual information [10]. 

An interpretation of the above formulation for DTI is in 
order. To infer the notion of influence between two time 
series (mRNA expression data) we find the mutual information 
between the entire evolution of gene X (up to the current 
instant n) and the current instant of Y (Y n ), given the evolution 
of gene Y up to the previous instant n — 1 (i.e. F" _1 ). This 
is done for every instant, n 6 (1, 2, ... , N), in the N - length 
expression time series. 

As already known, I(X N ;Y N ) = H(X N ) - H(X N \Y N ), 
with H(X N ) and H(X N \Y N ) being the entropy of X N and 
the conditional entropy of X N given Y N , respectively. Using 
this definition of mutual information, the DTI can be expressed 
in terms of individual and joint entropies of X N and Y N . The 
task of A^-dimensional entropy estimation is an important one 
and due to computational complexity and moderate sample 
size, histogram estimation of this multivariate density is un- 
viable. However, several methods exist for consistent entropy 
estimation of multivariate small sample data ([18], [50], [53], 
[71]). In the context of microarray expression data, wherein 
probe-level and technical/biological replicates are available, 
we use the method of [18] for entropy estimation. 

From (1), we have, 

JV 

I(X N Y N ) = ^[iJpriF"- 1 ) - H(X n \Y n )} 



n=l 



N 



^{[^(x n ,F n - 1 )- J fj(y n - 1 )]- 

H(X n ,Y n ) - H(Y n )]} 



n=l 



(2) 



To evaluate the DTI expression in Eqn.2, we need to 
estimate the entropy terms H(X n , Y n ^ 1 ), 
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H(X n , Y n ) and H(Y n ). This involves the estimation of 
marginal and joint entropies of n random variables, each 
of which are R dimensional, R being the total number 
of replicates (probe-level, biological and technical). 

• Though some approaches need the estimation of probabil- 
ity density of the i?-dimensional multivariate data (X n ) 
prior to entropy estimation, one way to circumvent this 
is to the use the method proposed in [18]. This approach 
uses a Voronoi tessellation of the i?-dimensional space 
to build nearly uniform partitions (of equal mass) of the 
density. The set of Voronoi regions (V 1 , V 2 , . . . , V n ) for 
each of the n points in i?-dimensional space is formed 
by associating with each point Xu, a set of points V k 
that are closer to Xk than any other point Xi, where the 
subscripts k and I pertain to the k th and I th time instants 
of gene expression. 

• Thus, the entropy estimator is expressed as : H(X n ) — 
A YTi=i log(nA(T/ 2 )), where A{V l ) is the i?-dimensional 
volume of Voronoi region V 1 . A(V l ) is computed as the 
area of the polygon formed by the vertices of the convex 
hull of the Voronoi region V 1 . This estimate has low 
variance and is asymptotically efficient [19]. 

To obtain the DTI between any two genes of interest 
(X and Y) with iV-length expression profiles X N and Y N 
respectively, we plug in the entropy estimates computed above 
into the expression (2). 

From the definition of DTI, we know that < I(X^ — > 
Y N ) < I(Xi, Y N ) < oo. For easy comparison with other 
metrics, we use a normalized DTI metric given by poi = 
y/l - e -2/(x«^Y«) = Vl-e^Si 7 ^^! 7 '" 1 '. This 
maps the large range of DI, ([0, oo]) to lie in [0, 1]. Another 
point of consideration is to estimate the significance of the 
'true' DTI value compared to a null distribution on the DTI 
value (i.e. what is the chance of finding the DTI value by 
chance from the series X and Y). This is done using empirical 
p-value estimation after bootstrap resampling (Sec: IX-A.21 i. A 
threshold p-value of 0.05 is used to estimate the significance 
of the true DTI value in conjunction with the density of a 
random data permutation, as outlined below. 

2) Significance Estimation of DTI: We now outline a 
procedure to estimate the empirical p-value to ascertain the 
significance of the normalized directed information I(X — > 
Y N ) between any two A^-length time series X = X N = 
(Xx,X 2 ,..., Xn), and Y = Y N = (Y U Y 2 , Y N ). In our 
case, the detection statistic is = I(X N — > Y N ) and the 
chosen acceptable p-value is a. 

The overall bootstrap based test procedure is ([15], [57], 
[25]): 

« Repeat the following procedure B{= 1000) times (with 
index 6 = 1,..., B): 

- Generate resampled (with replacement) versions of 
the times series X N , Y N , denoted by , Y b N 
respectively. 

- Compute the statistic 9 b = I(X^ -> Y b N ). 

• Construct an empirical CDF (cumulative distribution 
function) from these bootstrapped sample statistics, as 
F e (0) = P(6 < 9) = A Ef=i 4>o(z - - 9 b ), where 



/ is an indicator random variable on its argument x. 

• Compute the true detection statistic (on the original time 
series), 9 = I(X N — > Y N ) and its corresponding 
p-value (po = 1 — -FeC^))) un der the empirical null 
distribution -Fo(#). 

. If F e (6> ) > (1 - a), then we have that the true DTI 
value is significant at level a, leading to rejection of null- 
hypothesis (no directional association). 
3) Summary of DTI-based TF effector Inference: Our pro- 
posed approach using DTI for determining the effectors for 
gene B (Gata2 in the enhancer study) is as follows: 

• Identify the G genes (A\, A 2 , , . . , Aq), based on phy- 
logenetic conservation (Fig. [7]). Preprocess the gene 
expression profiles by normalization and cubic spline 
interpolation. Assuming that there are N points for each 
gene, entropy estimation is used to compute the terms in 
the DTI expression (Eqn. 2). 

• For each pair of genes A; and B among these G genes : 

{ 

- Look for a phylogenetically conserved binding site 
of TF encoded by gene A; in the upstream region of 
gene B. 

- Find DTI{Ai,B) = I(A? -> B N ), and the 
n ormalized DTI from A to B, DTI(Ai,B) = 

^/l _ e -2I(A?^B")^ 

- Bootstrap resampling over the data points of A an d 
B yields a null distribution for DTI(Ai,B). If the 
true DTI(Ai, B) is significant at level a with respect 
to this null histogram, infer a potential influence from 
A to B. 

- The value of the normalized DTI from A to B gives 
the putative strength of interaction/influence. 

- Every gene Aj which is potentially influencing B is 
an 'effector'. This search is done for each gene A 
among these G genes (Ai, A 2 , ■ ■ ■ , Ag)- 

} 

Note: As can be seen, phylogenetic information is inherently 
built into the influence network inference step above. We note 
that, in this approach, the choice of potential effectors for a 
target gene is based on only those TFs that have a binding site 
at the target gene's promoter. This aims to reduce the overall 
search space based on biological prior knowledge. 

As an example, we indicate the significance and strength of 
the DTI between the Pax2 TF and Gata2. The high strength 
of influence and its significance coupled with the phylogenetic 
conservation of the Pax2 motif indicates expression evidence 
for the role of Pax2 in Gata2 regulation ([7], [14]). 

Such analysis can be extended to all TFs that are phylo- 
genetically conserved. For Gata2 regulation in the developing 
kidney, this set of putative TF effectors (apart from Pax2) is 
shown in Fig. [9] 

B. Enhancer TF identification 

In the earlier section, we have examined the identification 
of promoter TFs using phylogenetic sequence conservation of 
TFBS motifs in conjunction with expression level influence 
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Fig. 8. Cumulative Distribution Function for bootstrapped I(Pax2 —* Gata2) 
interaction. True I(Pax2 -> Gata2) = 0.9818. 




Fig. 9. Putative upstream TFs using DTI for the Gata2 gene. 



using DTI. The next key step towards determining the nature 
of promoter-enhancer TF interactions is the identification of 
enhancer- TFs. As has been alluded to earlier, there is no 
method to precisely infer which transcription factors bind a 
certain regulatory element during long-range gene regulation. 
Thus, we appeal to a traditional approach of finding tissue- 
specific transcription factors that are phylogenetically con- 
served at any potential regulatory region ([56], [70]). This 
is consistent with earlier observations that enhancers recruit 
tissue-specific transcription factors during the formation of 
the overall transcriptional machinery during gene expression, 
whereas promoters recruit components of the basal transcrip- 
tional machinery ([35], [45], [38], [70], [66]). 

To ascertain the tissue-specificity of each TF that putatively 
binds a regulatory element (identified via phylogenetic con- 
servation), we examine that TF's annotation in the UNIPROT 
database. This database is one of the most current sources 
of TF annotation and has details pertaining to the sequence 
specificity of the binding motif, the structure of the TF and 
its tissue-specificity of expression. For those TFs that do not 
have a UNIPROT annotation, we look at the tissue-expression 
of the corresponding gene from the mouse genome informatics 
(MGI) mRNA annotations. The MGI expression annotations 
encompass multiple modalities (literature, RNA in-situ) to sug- 
gest a tissue-restricted or conversely, a ubiquitous expression 
of the TF gene. Thus, a set of tissue-specific transcription 
factors that bind any non-coding region of interest (such as an 
enhancer) can be identified ([52], [70], [56], [38], [45]). For 
the Gata2 UGEs, several potential TFs can be found, some of 
which are highlighted in Fig. [10] 



C. Enhancer-Promoter Distal Interaction via Protein-Protein 
Interactions - A Graph Based Analysis 

Using the notion of protein-protein interaction mediating 
long-distance interactions between promoters and enhancers 
during looping ([54], [4], [26]), we explore the interactome 
to look for within-group and between-group interactions in 
the promo ter-TF and the enhancer-TF groups. The resultant 
interaction-graph can be examined for several "structural" 
characteristics (like heterogeneity, degree distribution, path 
length, density, clustering coefficient and connected compo- 
nents) ([2], [12]). The goal is to identify structural features that 
discriminate true enhancer vs. non-functional element activity 
based on their interaction-graph. 

The interaction-graphs (e.g: Fig. ITOb are obtained in the 
following manner: 

• One part of the graph (hollow circles) corresponds to 
the TF effector group at the promoter. These V p TFs 
are identified based on phylogenetic conservation and 
directed information (section: IX-Al i. 

• The other part of the graph (filled circles) corresponds to 
the V e tissue-specific TFs group at the enhancer, identified 
based on phylogeny and annotation (section: IX-Bt . 

• The interaction-graph is defined by the vertices V — 
(V p U V e ), and the edges E = e itj , i,j G (1,2, ... , \V P U 
V e \). Each bidirectional edge E — (e^j) is derived 
from an annotated interaction between TFs i and j, 
based on an interaction database. These edges de- 
scribe both within-group TF interactions as well as 
between-group interactions. To obtain the TF interac- 
tions, we use protein-interaction information derived 
from the STRING (http://string.embl.de/) and MiMI 
(http://mimi.ncibi.org/MiMI/home.jsp) databases, both of 
which contain data derived from multiple sources, such 
as yeast-2-hybrid screens, literature, ChIP etc. Though 
there is some inherent noise in the accuracy of these high- 
throughput sources, they permit the use of a confidence 
threshold to discriminate a potentially true interaction 
from a spurious one. 

Though it would be of great value to use a catalog of gene- 
specific and tissue-specific regulatory regions (with all possible 
transcription factors) from which to find such interaction 
characteristics - such a repository does not yet exist. In this 
section, we use a few examples (Gata3 OVE, Gata3 KE, Fgf 
OVE, Mecp2 F21/F6 , Shh FE) of known tissue-specific and 
gene-specific regulatory elements from literature, as a positive 
training set. For the negative training set, we consider the set 
of regions that were reportedly investigated in these transgenic 
experiments but did not yield gene-specific regulatory activity. 
Based on which structural metrics are associated with potential 
regulatory activity for these examples, we will examine if these 
features are predictive of Gata2 UGE enhancer behavior, from 
its interaction-graph. 

We have presented a preliminary analysis of enhancer- 
promoter TF interaction-graphs for some genomic elements 
with known regulatory or non-regulatory activity ([44], [40], 
[27], [52]) in Table. Hill The table represents the listing of 
the structural attributes of these interaction-graphs, following 
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analysis methods from literature ([3], [2], [61]). A brief 
summary of these attributes are given below. A deeper analysis 
of other graph topology metrics and their relation to functional 
enhancer activity is a topic of future interest. 



Sequence 


Class 


Clustering 


Characteristic 


Heterogeneity 






Coefficient 


path length 




Mecp2 F21[44] 


+ 1 


0.208 


2.824 


0.668 


Mecp2 F6 [44] 


-1 





1.75 


0.342 


Gata3 OVE [27] 


+ 1 


0.036 


2.254 


0.779 


Gata3 KE [27] 


+ 1 


0.409 


2.0 


0.813 


Gata3 NE1 [27] 


-1 


0.383 


2.131 


1.139 


Gata3 NE2 [27] 


-1 


0.458 


2.013 


0.872 


FgflO OVE [52] 


+ 1 


0.313 


2.433 


0.72 


Shh FE [40] 


+ 1 


0.394 


2.312 


0.797 



TABLE III 

The first column is the various regulatory and 
non-regulatory elements from literature, the next column 
corresponds to its class label (+1/ — 1). the subsequent 
columns correspond to the attributes of the overall 
tf-inter action graph (both within-group and between-group 
interactions). 



Clustering coefficient: In undirected networks, the clus- 
tering coefficient C n of a node n is defined as C n = 
2e n /[fc n (fc„ — 1)], where k n is the number of neighbors 
of n and e„ is the number of connected pairs between 
all neighbors of n. Thus C n , of a node in a graph is 
the ratio of the number of edges between the neighbors 
of that node over the total number of edges that could 
exist among its neighbors. The clustering coefficient of a 
node is always a number between and 1. The network 
clustering coefficient is the average of the clustering 
coefficients for all nodes in the network. 
Characteristic Path length: The length of a path along 
the graph is the number of hops (or edges) between 
any two nodes along the graph. Though, there may be 
multiple paths between two nodes n and m (TFs) along 
the interaction-graph , the shortest path length L(n, m) = 
(L(m,n)) corresponds to the minimum across these 
multiple paths. This measure is computed for all pairs 
of nodes in the network. The characteristic path length 
denotes the average shortest-path distance of the graph. 
This gives the expected distance of any two connected 
nodes in the graph and is a global indicator of network- 
connectivity. 

Heterogeneity: Network heterogeneity denotes the coef- 
ficient of variation of the degree distribution. A network 
that is heterogeneous would consist of some nodes that 
are highly connected (exhibit 'hub' behavior), while the 
majority of nodes tend to have very few connections. Un- 
derstanding the heterogeneity of the degree distribution 
in biological networks is an interesting topic of current 
research, especially as a way to discover modularity [12]. 
Centralization: This refers to the overall connectivity 
(cohesion) of the graph. It indicates how strongly the 
graph is organized around its most central point(s). The 
central point(s) of the graph are the set of nodes which 
minimize the maximum distance distance from all other 



nodes in the graph [67]. Networks whose topologies 
resemble a star/wheel pattern have a centralization close 
to 1, whereas decentralized networks are characterized by 
having a centralization close to 0. 
• Density: The neighborhood of a given node n is the set 
Centra|jfatjgn neJgff&Srs. The connectivity of n, denoted by k n , 
184 is the siz^^ its neighborhood. The average number of 
0.067 neighbors, lirgdicates the average connectivity of a node 
0.359 i n the network. A normalized version of this parameter 
0757 is the nej;w£>rk density k n /n(n — 1). The density is a 
0.699 value betmagn and 1. It is also the average standardized 

0. 323 degree. F? y5ows how densely the network is populated 
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with edge's (i.e. how "close-knit" an empirical graph is 
[67], [63]). A network which contains no edges and solely 
isolated nodes has a density of 0, whereas the density of 
a clique (completely connected graph) is 1. 
The above mentioned several network properties (as well as 
clustering coefficients, number of connected components etc.) 
are examined for the overall interaction-graphs for the reported 
enhancers from literature [2]. A logistic regression as well as 
random forest analysis reveals that low values of heterogeneity, 
characteristic path length and centralization are fairly good 
predictors of potential enhancer activity. All of these attributes 
point to the decentralized, homogenous and somewhat tighter 
connectivity of the interaction-graphs for true enhancers. We 
note that the OOB error rate of the RF here is about 25%. The 
quality of this classifier can be expected to improve as we get 
more data from which to extract features. 

We now examine the interaction-graphs for the test set, 

1. e. the four Gata2 UGEs. For illustration, we only show the 
largest connected component of the inter-group edges for each 
interaction graph (Fig. ITOb . 




UG1 






UG2 



UG4 



UG3 



Fig. 10. Protein-protein interaction between putative Gatal TFs (hollow 
circles) and putative UG element TFs (filled circles). Note: This only shows 
the connections between two groups for one of the connected components. 
For our analysis, we consider both intra- and inter-group connections. From 
http://string. embl. de/ 

This figure indicates a very interesting property of the real 
enhancers vis-a-vis the other conserved elements. We see that 
the TF effectors for Gatal such as SP1, POU3F2 (identified in 
the TF effector network above, Fig. [9), are involved in cross- 
element interactions at the protein level, between the promoter 
and true enhancer (UG2/A). However, the network linkage in 
the elements that showed no enhancer activity is very sparse 
suggesting low cross-talk between promoter and enhancer. 
Also, the TFs at the enhancer nodes (dark circles) have a more 
uniform degree distribution in the functional elements U G2 /4 
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as compared to the non-functional ones. Both these observa- 
tions suggest lower heterogeneity and centralization of such 
functional interaction-graphs. Thus, the extent of TF cross- 
talk is a potential discriminator of possible enhancer function. 
This shows that superimposing PPI information along with 
sequence and expression data helps reduce the number of false 
positives while integrating various aspects of distal regulation. 

XI. Heterogeneous Data Integration and 
Validation on Gata2 UGEs 

As mentioned previously, the primary goal of the vari- 
ous methods developed above is to understand the behavior 
of known transcriptional elements along different genomic 
modalities. To validate their predictive potential, we have to 
demonstrate their application to predicting the behavior of 
the Gata2 UGEs (which is our test set). In this section, we 
present a framework that combines the results of the individual 
classifiers developed before (kidney-promoter RF, histone RF 
and interactome-RF) to obtain a integrated prediction. For 
combining heterogeneous classifiers, we will explore a "prob- 
abilistic belief fusion" framework in this paper. Of course, 
other techniques from literature (like ensemble methods) are 
also highly amenable for exploration in this context. 

The framework involves combining the 'beliefs' of the indi- 
vidual classifiers to obtain a combined belief of prediction. To 
compute the belief of each classifier we start with examining 
the confusion matrices for each of the classifiers (kidney- 
promoter RF, histone-RF and graph-RF), following ([24], [72], 
[8]). Since each of the classifiers are random forests, we 
can obtain their OOB error estimates through these confusion 
matrices. For the graph-RF, this confusion matrix is as below, 



Rendering of tr 



sifiers in ROC 5 



CM 



graph — RF 



'Class —1 1 class.error^ 
-1 4 1 0.20 
1 1 4 0.20 



thereby yielding an OOB error estimate of ~ 20%. 
Similarly, we have, 

(Class —1 1 class. error\ 
-1 67 19 0.22 
1 10 76 0.12 J 

thus yielding an OOB error estimate of ~ 17%. 



CM histone _ RF 




1 class.error^ 
24 0.15 
119 0.15 



yielding an OOB error estimate of ~ 15%. 

The three random forest classifiers are represented in ROC 
space (Fig. [TTV As can be seen, these three classifiers have 
fairly good performance characteristics. Moreover these are 
three complementary data sources and can be effectively 
combined to improve detection reliability. Since they are 
trained on very different modalities, they can be assumed to 
be independent. 

Each classifier is a function ek(x) = j k that maps a data 
point (x) to the class 'j', with k = 1,2, ... ,K and jk £ 
(— 1, 1). Here, K — 3, and J = 2 classes. 



- graph-RF 



Fig. 11. Representation of the three RF classifiers in ROC space (RF- 
promoter in (+), and RF-histone in (o), and graph-RF in (x)). The diagonal 
line is the classification by random chance. 



Thus, the belief of the k th classifier is, 

bel k (x £ Ci\e k {x) = j k ) = P(x £ C t \e k (x) = j k ) 
The overall belief, bel( i), given by, 

bel(i) = bel(x £ d\ei(x) = ji , . . . , e K (x) = Jk) = 
P(x £ Ci\ex(x) = ji, . . .,e K (x) = j K ) 
= P{ei(x) =ji,...,e K (x) =j K \x £ d).P(x £ d) 
P{ei(x) = ji,...,e K (x) = Jk) 
Further, we have that, 

PL P(e k (x) = Jk \x £ d) _ Y\Li P(x £ d\e k (x) = j k ) 

n^i^(e*(x)=ifc) nL^e^) 

Thus, 



IlfcLi P(x £ Cj\e k (x) = j k ) 



bel(i) = P(x G Q). 



nti P(x G d) 
(due to independence of the K classifiers,) 



In the absence of the posterior probability P(x G Ci), an 
approximation is used, leading to [72], 

Y[ k K = 1 P{x&C i \e k {x)=jk) 



bel(d 



Y,LiIlLi P (x£C i \ek(x)=jk) 



Note: J = 2 and K = 3. Depending on the belief value bel(i), 
the decision rule (E(x)) for classifying data point x is, 

E(x) = j, if bel(j) — maxi bel(i), 
or, E(x) — j, if bel(j) — maxi bel(i), and, bel(j) > a, 

where < a < 1, with a being a threshold. 

We now show the output classes of each of the 3 classifiers 
as well as the combined belief on the Gata2 UGEs in Table. 
HVI More specifically, for the first row in Table. HVl the overall 
belief equation above becomes, 



bel(ugl = +1) = 



Pjugl = +l\e 1 (x) = -l).P(ugl = +l|e 2 (x) = 
P(ugl = +l|ei(x) = -l).P(ugl = +l|e 2 (a;) = - 
P(ugl = -l|ei(a;) = -l).P{ugl = -l|e 2 (x) 
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Sequence True Promoter RF Histone RF Interaction-graph RF P(Class=+l) 

Class prediction e\(x) prediction e<i{x) prediction e^(x) (Overall Belief) 

Gata.2 UG1 -1 -1 -1 -1 0.0054 

Gata2 UG2 +1 +1 +1 +1 0.9875 

Gata2 UG3 -1 +1 +1 -1 0.832 

Gata2 UG4 +1 +1 +1 +1 0.9875 

TABLE IV 

Combined belief generation during heterogeneous classifier integration. The last column represents the combined belief 
(probability that the ug sequence is an enhancer) as a result of integrating the promoter-rf, histone-rf and graph-rf 

predictions. 



[(1 - precox) x (1 -prec n .i)\ x [(1 - prec n ^\ 



ppnnmp-wiHp It is to be noted that this model is data driven 



Here, prec n 



(l-prec nA ) x (1 -prec na ) x (1 - prec„ ;3 )] + [prec nA x pr%M2rihf r rffin directly correS p nd to the biology of transcrip- 

TP k tion. However, much like markov models for gene sequence 
annotation, we believe that such data-driven models are useful 
for model-building during genome-wide study. 



Similarly, prec Pt k 



- TNk + FNk - F>^p,k - T P k +FP k - 

These are the negative and positive precision values respec- 
tively, for the k th classifier. These rates are obtained from the 
corresponding confusion matrices shown above. This approach 
is followed for each of the UG1 — 4 elements. 

If we set a threshold of a = 0.85 or 0.90, we would get 
UG2 and UGA to be the true enhancers (100% accuracy). 
However, for a choice of a = 0.8, UG3 is predicted to be an 
enhancer in spite of being declared a member of the (—1) class 
by the graph-RF. This choice of threshold thus determines the 
performance of the combined classifier. 

Under the a = 0.8 case, however, the results are not to be 
interpreted as a 25% error rate since the nature of the test set 
(Gata2 UG enhancers) are very different from the training data 
of each modality (promoters are proximal elements whereas 
enhancers are distal; histone sequences are for a different 
cell-context; and interaction-graphs are obtained over different 
genes). The fact that we are getting such good prediction in 
spite of the training sets being so different is a strong point 
in favor of examining and integrating these data sources. The 
real test-error rates are given by the OOB error estimates of 
the individual classifiers. 

XII. Summary of Approach 
In this work, we have shown that, 

• Motif signatures are predictive of regulatory element 
location. These comprise sequence-motifs derived from 
tissue-specific gene promoter sequences as well as se- 
quences related to epigenetic preferences during gene 
regulation. 

« Promoter and enhancer TFs that are putatively recruited 
during gene (Gata2) regulation can be identified using 
a combination of phylogenetic conservation, expression 
data, and tissue-specificity annotation. 

• Effector TFs (via DTI) at the gene proximal promoter 
have high network linkage with enhancer TFs in case 
of functional enhancers. The TF interaction-graphs of 
truly functional elements are seen to be have a lower 
centralization, characteristic path length and heterogene- 
ity suggesting higher cross-talk during formation of the 
transcription factor complex. 

These perspectives (based on sequence, expression and 
interactome data) shed some light on the sequence and mech- 
anistic preferences of true regulatory regions interspersed 



XIII. Conclusions 

In this work, we have examined the problem of regulatory 
element identification. Such an effort has implications to 
understand the genomic basis of key biological processes such 
as development and disease. Using the biophysics of transcrip- 
tion, this can be modeled as a problem in data integration over 
various experimental modalities such as sequence, expression, 
transcription factor binding and interactome-data. Using the 
case study of enhancers corresponding to the Gata2 gene, we 
examine the utility of these heterogeneous data sources for 
predictive feature selection, using principled methodologies 
and metrics. 

Based on motif signatures, we find that they predict the 
true enhancers (U G2, U G4), and the false enhancer UG1, but 
mispredict U G3 to be an enhancer. However, a mechanistic in- 
sight that analyzes enhancer behavior based on the interactions 
between distally and proximally recruited transcription factors 
can greatly improve on prediction accuracy. Additionally, 
combining heterogeneous classifiers based on multiple data 
modalities yields an improved accuracy of prediction. 

The novelty of the proposed work spans several areas. 
Firstly, data sources that are relevant to understand the mech- 
anism of gene regulation (with Gata2 as an example) have 
been identified. We have developed methods that reconcile 
the behavior of known regulatory elements along each of 
these modalities. The kidney-promoter based classifier aims to 
discover sequence preferences of kidney-specific regulatory re- 
gions. The utilization of histone-modified sequences and their 
exploration for sequence motifs are indicative of epigenetic- 
preferences and nucleosome-occupancy patterns. This has not 
been explored before in the realm of LRE characterization. 
The use of DTI as a metric to infer putative TF to target-gene 
influence is a recent one that serves to integrate phylogenetic 
TFBS conservation along with expression data. Finally, the 
utilization of graph-based analysis techniques to understand 
the "structure" of the TF interaction-graph between enhancer 
and promoter helps us understand true enhancer behavior from 
a mechanistic viewpoint. The probabilistic combination of 
multiple classifiers (each deriving from a unique data resource) 
aims to reconcile the behavior of existing enhancers along 
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multiple modalities. We hope to demonstrate that a principled 
integration of non-overlapping genomic modalities can be used 
to interpret the context and specificity of gene regulation. 

XIV. Future Work 

Some key elements directly emerge for guiding future re- 
search. As already alluded to in the motif-signature procedure, 
specific expression data corresponding to stages and tissues 
of interest would greatly improve the specificity of regulatory 
element prediction. Furthermore, as histone modification maps 
for different cell lines are generated, the false positive rate 
of prediction would decrease, thereby improving accuracy. 
Several other learning paradigms can be introduced into this 
setting, since we are learning from structured data. Also, 
methods in joint classifier and feature optimization might 
likely improve the accuracy of predictions. Additionally, meth- 
ods that analyse the grammar of these cis-regulatory regions 
(LREs) and look for motif position, spacing and orientation 
will be of great utility. 

At the expression level, methods for supervised network 
inference would have a great impact on the discovery of TF 
effectors. Rapid advances have been made in this area and their 
relevance to the biological context of the problem has become 
very principled. At the interactome level, the work presented 
here can be extended to the investigation of graph-clusters for 
weighted interaction-graphs. The weighted edges are obtained 
from the confidence of the individual data sources, as well 
as the number of species over which that particular edge is 
conserved ([3], [65]). Such analysis enables the discovery of 
subgraphs of various degrees of inter-connectedness, thereby 
discovering functional "graph-motifs". 
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